Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words

نویسندگان
چکیده

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Swahili Text-to-speech System

Text-to-speech (TTS) applications have been applied in diverse areas all over the world. Considering the fact that Swahili pronunciation is not complicated, and the language spoken by about 45 – 100 million people as their first or second language,, we considered the feasibility, and developed a Swahili Text-to-Speech (TTS) system. This paper gives an account of the Swahili TTS system developed...

متن کامل

A MODEL FOR EVOLUTIONARY DYNAMICS OF WORDS IN A LANGUAGE

Human language, over its evolutionary history, has emerged as one of the fundamental defining characteristic of the modern man. However, this milestone evolutionary process through natural selection has not left any ’linguistic fossils’ that may enable us to trace back the actual course of development of language and its establishment in human societies. Lacking analytical tools to fathom the cr...

متن کامل

Exploring text datasets by visualizing relevant words

When working with a new dataset, it is important to first explore and familiarize oneself with it, before applying any advanced machine learning algorithms. However, to the best of our knowledge, no tools exist that quickly and reliably give insight into the contents of a selection of documents with respect to what distinguishes them from other documents belonging to different categories. In th...

متن کامل

Web-based corpus acquisition for Swahili language modelling

Finding large amounts of text data for use in natural language technology is difficult for under-resourced languages such as Swahili. The corpora that are readily accessible for these languages are not sufficient to be used in language technologies, whose requirements can run into the hundreds of millions of words. This paper describes how we can take advantage of search engines such as Google ...

متن کامل

Competitive Intelligence Text Mining: Words Speak

Competitive intelligence (CI) has become one of the major subjects for researchers in recent years. The present research is aimed to achieve a part of the CI by investigating the scientific articles on this field through text mining in three interrelated steps. In the first step, a total of 1143 articles released between 1987 and 2016 were selected by searching the phrase "competitive intellige...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Data in Brief

سال: 2020

ISSN: 2352-3409

DOI: 10.1016/j.dib.2020.106517